Methods for Frequent Sequence Mining with Subsequence Constraints
نویسنده
چکیده
In this thesis, we study scalable and general purpose methods for mining frequent sequences that satisfy a given subsequence constraint. Frequent sequence mining is a fundamental task in data mining and has many real-life applications like information extraction, market-basket analysis, web usage mining, or session analysis. Depending on the underlying application, we are generally interested in discovering certain frequent sequences, which are described using subsequence constraints. There exists many tools and algorithms for this task, however, they are not su ciently scalable to deal with large amounts of data that may arise in applications and are generally not extensible across range of applications. We propose scalable, distributed sequence mining algorithms that target MapReduce. Our work builds on MG-FSM, which is a distributed framework for frequent sequence mining. We propose novel algorithms that improve and extend the basic MG-FSM framework to e ciently support traditional subsequence constraints that arise in applications. Additionally, we show that many subsequence constraints— including and beyond the traditional ones considered in literature—can be uni ed in a single framework. A uni ed treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. To this end, we propose a general purpose framework that provides a set of simple and intuitive “pattern expressions”, which allows to describe any subsequence constraint of interest and explore algorithms for e ciently mining frequent subsequences under such general constraints. Our experimental study on real-world datasets indicates that our proposed algorithms are scalable and e ective across wide range of applications.
منابع مشابه
Efficiently Mining Closed Subsequences with Gap Constraints
Mining frequent subsequence patterns from sequence databases is a typical data mining problem and various efficient sequential pattern mining algorithms have been proposed. In many problem domains (e.g, biology), the frequent subsequences confined by the predefined gap requirements are more meaningful than the general sequential patterns. In this paper we re-examine the closed sequential patter...
متن کاملMining Frequent Graph Sequence Patterns Induced by Vertices
The mining of a complete set of frequent subgraphs from labeled graph data has been studied extensively. Furthermore, much attention has recently been paid to frequent pattern mining from graph sequences (dynamic graphs or evolving graphs). In this paper, we define a novel class of subgraph subsequence called an “induced subgraph subsequence” to enable efficient mining of a complete set of freq...
متن کاملAlgorithms for Computing Variants of the Longest Common Subsequence Problem ? ( Extended
The longest common subsequence(LCS) problem is one of the classical and wellstudied problems in computer science. The computation of the LCS is a frequent task in DNA sequence analysis, and has applications to genetics and molecular biology. In this paper we define new variants, introducing the notion of gap-constraints in LCS problem and present efficient algorithms to solve them. The new vari...
متن کاملSurvey of Sequential Pattern Mining Algorithms and an Extension to Time Interval Based Mining Algorithm
Sequential pattern mining finds the subsequence and frequent relevant patterns from the given sequences. Sequential pattern mining is used in various domains such as medical treatments, natural disasters, customer shopping sequences, DNA sequences and gene structures. Various sequential pattern mining algorithms such as GSP, SPADE, SPAM and PrefixSpan have been proposed for finding the relevant...
متن کاملEfficient Identification of Common Subsequences from Big Data Streams Using Sliding Window Technique
We propose an efficient Frequent Sequence Stream algorithm for identifying the top k most frequent subsequences over big data streams. Our Sequence Stream algorithm gains its efficiency by its time complexity of linear time and very limited space complexity. With a pre-specified subsequence window size S and the k value, in very high probabilities, the Sequence Stream algorithm retrieve the top...
متن کامل